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1  INTRODUCTION 


1.1  WHAT  IS  PARALLEL  PROCESSING? 

The  task  of  parallel  processing  is  to  develop  a  good  strategy  to  partition  a  certain 
computational  task  into  a  set  of  subtasks  and  to  assign  each  subtask  to  a  processor  on  a 
parallel  system.  It  seems  to  be  difficult  to  design  a  single,  high-level  parallel  processing 
model  and  a  set  of  tools  that  are  efficient  for  all  applications  and/or  for  all  parallel 
systems.  For  a  given  (type  of)  application  and  a  target  parallel  system,  however,  good 
parallelism  can  be  achieved  by  minimizing  message-passing  overhead  and  maximizing 
load  balance.  Accurate,  theoretical  performance  evaluation  of  a  nontrivial  parallel  algo¬ 
rithm  running  on  a  message-passing  system  is  generally  not  easy  except  for  algorithms 
that  involve  few  interprocessor  communications.  An  upper  limit  of  the  performance  of  a 
parallel  application  can  be  found  using  the  well-known  Amdahl’s  law  (reference  1):  Sup¬ 
pose  a  computational  task  needs  a  sequential  executioh  time  To  and 


where  T,  is  the  execution  time  taken  by  the  part  of  the  computation  that  is  fully  paral- 
lelizable,  and  Tnp  is  the  part  that  is  basically  nonparallelizable.  Now,  if  we  perform  the 
computation  on  a  parallel  system  with  P  processors,  the  execution  time  is  (assuming  a 
perfect  load  balance) 


The  speed-up  using  P  processors  is 


ro_ _ 

5  +  3^’ 


(1) 


and  the  efficiency  is  defined  as 


P  5  To 

P  T,  +  PT„,' 

Although  this  model  is  oversimplified  for  analyzing  many  practical  parallel  applica¬ 
tions,  it  does  indicate  how  efficiently  one  can  use  a  multiprocessor  system.  Let  us  assume 
there  is  an  application  for  which  To  =  1  and  3J»  =  0.95 .  According  to  equation  1,  using 
10  processors  would  produce  a  speed-up  of  7  and  an  efficiency  of  70%.  But  using 
100  processors  would  only  produce  a  speed-up  of  17  and  an  efficiency  of  17%.  This 
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example  tells  us  that  a  massively  parallel  machine  can  be  used  efficiently  only  when  the 
nonparallelizable  part  of  an  application  is  very  small  compared  with  the  parallelizable 
part. 

Programming  a  distributed  memory,  multiple-instruction,  multiple-data  (MIMD)  paral¬ 
lel  computer  is  indeed  more  difficult  than  programming  a  shared  memory  or  single¬ 
instruction,  multiple-data  (SIMD)  parallel  computer.  This  difficulty  is  related  to  the  paral¬ 
lel  processing  flexibility  of  such  a  system.  TTiis  flexibility  offers  actual,  efficient 
implementations  of  various  efficient  parallel  algorithms.  Though  it  is  possible  to  write 
some  basic,  low-level  parallel  programming  tools,  it  is  generally  not  easy  to  design  an 
efficient  high-level  programming  tool  that  fits  all  applications.  For  a  parallel  application 
developer  on  such  a  system,  a  possible  way  to  efficient  implementation  of  a  parallel 
application  algorithm  is  to  use  a  conventional  high-level  programming  language  like  C  or 
Fortran  plus  some  low-level  parallel  system  library  function  calls.  One  reason  is  that 
programming  at  this  level  gives  one  the  flexibility  of  reconfiguring  in  software  the  proces¬ 
sor  network  so  as  to  make  it  fit  the  best  algorithm  one  can  come  up  with.  Another  reason 
is  that  at  this  level,  the  programmer  has  full  control  over  actual  message-passing  opera¬ 
tions.  For  fine-grain  applications,  one  may  need  to  be  as  ‘greedy’  as  possible  in  terms  of 
data  movement  to  minimize  communication  overhead,  since  otherwise  the  parallel  per¬ 
formance  may  be  consumed  by  this  overhead,  as  can  be  seen  from  Amdahl’s  law. 

1.2  PARALLEL  SIGNAL  AND  IMAGE  PROCESSING  MODELS 

Many  composite  signal  and  image  processing  algorithms  can  be  implemented  on  a 
two-dimensional  mesh  of  processors  with  different  components  of  the  algorithm  proc¬ 
essed  in  a  pipelined  fashion.  In  pipelined  processing,  for  example,  we  could  partition  the 
whole  iWtirp  array  into  several  connected  subarrays,  with  each  subarray  performing  a 
different  part  of  the  algorithm.  The  throughput  of  pipelined  processing  is  clearly  deter¬ 
mined  by  the  slowest  part  in  the  chain.  Pipelined  processing  falls  into  the  category  of 
“parallel  processing  by  function  partitioning.”  Another  popular  model  for  parallel  proc¬ 
essing  is  “data  partitioning,”  in  which  each  processor  deals  with  a  subset  of  the  data  set 
through  the  entire  algorithm.  Using  this  model,  it  is  important  to  find  a  reasonable  way  to 
partition  and  distribute  the  data  so  that  the  interprocessor  communication  overhead  is 
minimized  for  an  application  algorithm.  The  throughput  of  the  data  partition  model  is 
limited  by  the  processor(s)  with  the  largest  workload.  Therefore,  load  balancing  is  one  of 
the  crucial  factors  affecting  the  performance  using  this  model.  In  mapping  some  complex 
signal  and  image  processing  algorithms  to  a  parallel  system,  it  should  be  possible  to 
design  an  efficient  approach  using  a  combination  of  these  two  models.  The  two  computa¬ 
tional  models  are  illustrated  in  figures  1  and  2,  respectively. 
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Figure  1.  Function  partition  and  pipelined  processing. 
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Load  balancing  seems  to  be  a  less  challenging  problem  for  many  signal  and  image 
processing  applications  since,  in  many  cases,  computational  tasks  can  be  evenly  parti¬ 
tioned  (or  nearly  so)  and  assigned  to  processors  before  the  computation  starts  and  the 
computational  loads  for  different  processors  are  static  through  the  whole  computation. 
Minimizing  communication  cost  is  still  an  important  issue  to  consider  when  mapping 
signal  and  image  processing  algorithms  to  a  message-passing  parallel  system.  For  many 
pipelined  types  of  processing,  it  is  crucial  to  have  a  set  of  processors  capable  of  perform¬ 
ing  external  I/O  to  keep  up  with  the  processing  rate  within  the  parallel  system. 

Section  2  discusses  iWarp  architecture  and  programming  on  the  iWarp  system.  Sec¬ 
tion  3  discusses  programming  tools  development  on  the  iWarp  system.  Section  4  presents 
mapping  strategies  for  implementing  signal  and  image  processing  applications  onto  the 
iWarp  system  and  their  performance  results.  In  appendbc  A,  we  include  some  codes  that 
illustrate  how  to  write  an  application  code  on  the  iWarp  system:  a  head  file  showing 
necessary  include  files  and  global  declarations;  a  processor-network  set-up  function  illus¬ 
trating  setting  up  a  logical  network  and  performing  message-passing;  a  main  program 
using  our  parallel  programming  tools  to  implement  a  weighted  differencing  of  two 
images. 


2  THE  iWARP  SYSTEM 


2.1  iWAR?  SYSTEM  CHAKACTERISTICS 

The  iWarp  system  is  a  distributed  memory,  MIMD  multiprocessor  system.  Each  iWarp 
processor  consists  of  three  distinct  components:  a  computation  agent,  a  communication 
agent,  and  a  local  memory  unit.  The  computation  agent  performs  the  normal  programmed 
computation.  The  communication  agent  handles  message-passing  between  different  proc¬ 
essors.  The  local  memory  unit  provides  access  to  local  memory  for  both  the  computation 
agent  and  communication  agent.  Each  iWarp  processor  has  a  0.5-megabyte  (Mbyte)  base 
memory  that  can  be  upgraded  in  0.5-Mbyte  increments.  Each  iWarp  processor  has  an 
upper-limit  speed  (or  “peak  performance”)  of  20  MFLOPS  on  a  C-Step  chip  for  single 
precision  floating-point  operations  and  10  MFLOPS  on  double  precision  operations.  The 
peak  performance  for  a  B-step  chip  is  half  that  of  a  C-step  chip.  The  bandwidth  of  inter¬ 
processor  message-passing  is  40  Mbytes/second  in  each  of  the  four  directions.  So  roughly 
speaking,  a  minimum  of  four  single  precision  operations  are  needed  on  each  word 
(4  bytes)  of  data  being  sent  on  to  gain  through  parallelism.  iWarp  has  a  interprocessor 
communication  library  called  PathLib.  The  PathLib  implementation  offers  a  low-latency, 
high-bandwidth  interprocessor  communication.  The  iWarp  system  was  designed  for  fine- 
grain  types  of  applications,  for  example,  signal  and  image  processing. 
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2.2  MESSAGE-PASSING  ON  THE  iWARP  SYSTEM 


The  processor  network  on  the  iWarp  system  is  a  two-dimensional  toroidal  mesh,  which 
we  call  the  iWarp  array.  One  advantage  of  such  a  topology  is  that  any  processor  on  the 
array  is  no  different  from  any  other  processors  on  the  array.  This  processor  connection 
structure  makes  it  convenient  to  write  scalarable  parallel  code  and  some  important  paral¬ 
lel  program  development  tools.  For  example,  it  is  easy  to  write  an  application  code  that 
runs  on  any  rectangular  iWarp  array  and  performs  I/O  at  specified  processors. 

Message-passing  protocols  are  included  in  the  PathLib  system.  There  are  three  types 
of  modes  a  programmer  can  use  for  performing  message-passing.  In  iWarp  terminology, 
these  modes  are  called  streaming,  express,  and  spooling  (references  2  and  3).  Streaming 
provides  the  mechanism  for  fast,  ‘door-to-door’  small  message  delivery,  particularly  use¬ 
ful  for  pipeline  or  systolic  processing.  It  was  implemented  using  fast,  special  registers  as 
output  and  input  ‘gates’  so  that  a  message  does  not  need  to  go  through  local  memories  of 
sending  and  receiving  processors  during  the  message-passing  process.  Express  is  a  facility 
that  allows  each  processor  to  connect  a  pair  of  message-passing  I/O  gates  so  that  mes¬ 
sages  can  go  through  the  processor’s  communication  hardware  from/to  its  neighboring 
processors  without  affecting  the  activities  taking  place  at  this  processor.  I/O  gates  can  be 
connected  or  disconnected  at  runtime,  which  provides  a  means  for  flexible  and  efficient 
interprocessor  communication.  Spooling  allows  transmitting  a  large  message  from  a  mes¬ 
sage  buffer  at  a  sending  processor  to  the  message  buffer  at  a  receiving  processor. 
Message-passing  in  the  spooling  mode  is  asynchronous  in  the  sense  that  the  sending  proc¬ 
essor  can  start  the  next  step  of  processing  after  telling  the  processor’s  communication 
hardware  to  send  a  certain  message  buffer  to  a  destination  processor;  and  the  receiving 
processor’s  communication  hardware  will  put  the  received  messages  in  its  local  memory 
without  interfering  with  other  activities  going  on  at  the  processor.  Hence,  the  proper  use 
of  message  spooling  makes  it  possible  to  overlap  communication  and  computation.  On  the 
other  hand,  streaming  mode  message-passing  is  synchronous  in  the  sense  that  sending 
and  receiving  processors  must  coordinate  properly  during  the  message-passing.  Inconsis¬ 
tency  in  the  pair  of  processors’  actions  may  cause  the  program  execution  to  stop  there. 

2.3  UNIX-C  PROGRAMMING  INTERFACE  TO  THE  iWARP  SYSTEM 

If  we  use  a  parallel  computer  to  perform  a  computationally  intensive  and  well- 
parallelized  part  of  an  application  algorithm,  other  parts  of  the  algorithm  (e.g.,  visualiza¬ 
tion  and  some  sequential  part  of  the  algorithm)  might  better  be  executed  on  another  (may 
be  sequential)  machine.  Indeed,  there  has  been  a  growing  interest  in  the  supercomputing 
world  for  a  heterogeneous  computing  environment  in  which  parallel  systems  can  interface 
with  reasonable  efficiency  with  various  workstations  and  other  systems.  For  example,  it 
would  be  nice  to  be  able  to  perform  message-passing  between  a  program  running  on  a 
Sun  workstation  and  a  program  running  on  any  of  the  processors  on  a  parallel  system. 
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The  Intel  iWarp  system  provides  a  software  mechanism  called  the  imsg  facility  that 
allows,  at  the  application  programming  level,  the  message-passing  between  a  C  program 
running  on  a  host  workstation  and  a  C  program  running  on  the  iWarp  system.  The  way  it 
works  can  be  briefly  described  as  follows.  We  call  the  program  running  on  the  host  work¬ 
station  the  host  program  and  call  the  program  running  on  the  iWarp  array  the  cell  pro¬ 
gram.  We  first  need  to  load  the  cell  program  to  the  iWarp  array  because  the  iWarp  load 
program  will  also  start  a  controlling  process,  called  the  iWarp  daemon  process,  which  is 
needed  for  the  host  program’s  initialization  action  (e.g.,  the  host  program  needs  to  get  the 
memory  address  of  the  iWarp  controlling  process).  Then  we  can  load  the  host  program. 
The  communication  between  the  host  program  and  cell  program  is  realized  by  reading/ 
writing  from/to  a  certain  memory  address  of  the  iWarp  controlling  process.  At  the  appli¬ 
cation  programming  level,  this  facility  makes  one  think  it  is  possible  to  pass  data  from  a 
host  machine  to  any  iWarp  cell  directly.  But  the  actual  implementation  of  this  mechanism 
on  the  iWarp  is  a  token-ring  message-passing.  Therefore,  even  assuming  the  imsg  imple¬ 
mentation  itself  is  efficient,  the  imsg  facility  cannot  do  parallel  message-passing  on  the 
iWarp  array,  which  would  give  a  bad  performance  for  some  parallel  applications.  We 
think  the  imsg  facility  is  useful  when  we  need  to  have  the  host  program  control  the  execu¬ 
tions  of  cell  programs  or  when  we  need  to  pass  some  data  from  the  host  program,  which 
may  be  the  result  from  a  computation  performed  on  the  host  workstation,  to  the  iWarp 
array. 


3  iWARP  PROGRAMMING  TOOL  DEVELOPMENT 

Although  using  a  conventional  programming  language  with  some  PathLib  function 
calls  offers  great  flexibility  and  efficiency  in  mapping  various  application  algorithms  onto 
the  iWarp  system,  we  do  not  want  to  rewrite  much  of  the  high-level  communication  and 
simple  global  operation  code  in  every  application  program  development.  To  increase  the 
parallel  code  reuse,  we  have  been  writing  some  efficient  parallel  programming  tools  for 
the  iWarp  system.  These  tools  are  in  the  form  of  C  functions  and  can  be  called  to  facili¬ 
tate  parallel  code  development  on  the  iWarp  system. 

The  parallel  programming  tools  we  have  developed  include  functions  for  setting  up 
processor  networks  of  rings  (unidirectional  and  bidirectional)  and  two-dimensional 
toroidal  meshes  of  various  sizes;  data  movement  functions  for  broadcasting,  scattering, 
and  gathering  a  data  array  to  or  from  an  array  of  iWarp  processors;  a  function  for  sum¬ 
ming  up  a  set  of  vectors  or  matrices  distributed  on  an  iWarp  array  and  putting  the  result 
at  one  of  the  processors  in  the  iWarp  array;  and  a  function  for  transposing  a  two- 
dimensional  data  array  distributed  on  an  iWarp  array.  We  point  out  that  with  the  code 
that  sets  up  a  two-dimensional  toroidal  mesh,  an  embedded  bidirectional  ring  can  be 
obtained  by  properly  defining  the  message-passing  VO  handles  (for  an  explanation  of 
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these  handles,  see  references  2  and  4).  Figure  3  shows  a  ring  network  embedded  in  a 
4x4  toroidal  mesh. 
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Figure  3.  A  ring  network  embedded  in  a 
two-dimensional  network. 

For  image  processing  applications,  the  typical  data  structure  is  two-dimensional.  A 
basic  assumption  we  have  made  in  terms  of  data  partition  is  to  have  each  processor  hold 
a  few  rows  or  columns  of  the  whole  data  array.  The  scatter  function  can  be  called  to  read 
from  a  disk  file  a  two-dimensional  data  array  and  assign  a  submatrix  (by  rows  or  by 
columns)  to  each  processor  of  the  iWarp  array.  The  gather  function  can  be  called  to 
output  submatrices  distributed  on  an  iWarp  array  to  a  disk  file.  The  scatter  and  gather 
functions  never  hold  the  whole  data  array  since  we  also  assume  no  processor  would  have 
enough  local  memory  to  store  the  entire  data  array  for  an  application.  At  this  moment, 
the  performance  of  such  I/O  operations  is  restricted  by  the  relatively  low  bandwidth  be¬ 
tween  the  'Warp  system  and  external  I/O  devices.  For  efficient,  pipelined  real-time  com¬ 
puting,  e.g.,  for  systolic  processing,  we  may  need  at  least  one  row  or  column  of  proces¬ 
sors  connected  to  external  I/O  devices  to  get  enough  I/O  bandwidth.  For  some 
applications,  data  overlapping  at  different  processors  is  necessary  (e.g.,  to  form  a  window 
of  certain  size).  Therefore,  we  have  generalized  the  scatter  function  so  that  an  argument 
to  the  function  can  be  specified  for  the  number  of  overlapping  rows  (columns)  needed. 

The  global  sum  operation  is  often  needed  in  computing  on  a  distributed  memory  sys¬ 
tem.  An  efficient  implementation  of  such  a  function  is  important  to  the  performance  of 
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an  application  requiring  many  such  operations.  On  a  toroidal  mesh,  we  implemented  a 
global  sum  function  using  the  following  algorithm,  assuming  the  linear  dimension  of  the 
iWarp  subarray  used  is  a  power  of  2; 


A  Reduction  Sum  Algorithm  Using  Express  (Split-Join) 

Redefine  row  and  colunm  ID  for  this  procetsor  relative  to  the  output  processor  ID; 
/*  right  to  left  sum  */ 

Bind  output  gate  to  left  and  input  gate  to  ri{^t; 
if  column  ID  =  0 

Receive  and  add  width  -  1  huifers  to  the  local  buifer; 

Send  out  a  software-mark 
Goto  label  A; 
flag «-  1; 
while(flag  ^  0) 

if  column  ID  is  odd 
Send  out  a  buffer; 

Join  right-left  gates; 
if  a  software-mark  comes 

Receive  and  send  out  a  software-mark; 
flag  ♦-  0; 

if  colunm  ID  is  even 

Receive  and  add  a  buffer  to  the  local  buffer; 
column  ID  *-  colunm  ID  /  2; 
if  colunm  >  1,  exit; 
labd  A: ; 

/*  down  to  up  sum  */ 

Bind  output  gate  to  up  and  input  gate  to  down; 
if  column  ID  =  0 

Receive  and  add  hei|^t  -  1  buffers  to  the  local  buffer; 

Send  out  a  software-mark 
Goto  label  B; 
fl  ag  ♦-  1; 
while(flag  ^  0) 

if  colunm  ID  is  odd 
Send  out  a  buffer; 

Jean  down-up  gates; 
if  a  software-mark  comes 

Receive  and  send  out  a  software-mark; 
flag  0; 

if  colunm  ID  is  even 

Receive  and  add  a  buffer  to  the  local  buffer; 
colunm  ID  *-  colunm  ID  /  2; 
labd  B:  exit; 
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This  function  takes  six  parallel  additions  to  get  the  sum  of  a  linear  array  of  data 
distributed  on  an  8  x  8  iWarp  array,  with  the  result  put  at  one  of  the  processors  specified 
as  an  argument  to  the  function.  In  general,  the  reduction  sum  algorithm  needs 
parallel  additions  to  get  the  sum  of  N  numbers.  It  is  often  the  case  that  the  result  of  a 
global  operation  is  needed  at  all  processors  for  subsequent  computing.  This  can  be  done 
by  a  broadcast  (or  replicate)  function,  which  sends  a  copy  of  a  buffer  of  data  to  each 
processor  on  the  iWarp  arr-'y.  An  efficient  way  to  implement  a  broadcast  function  on 
iWarp  is  to  use  the  express  facility. 

Another  global  data  movement  operation  is  the  transpose  of  a  two-dimensional  data 
array  distributed  on  the  iWarp  array.  The  reason  that  this  operation  is  useful  can  be 
explained  as  follows.  Suppose  we  have  distributed  the  data  matrix  by  columns  to  the 
iWarp  array  and  we  would  like  to  do  a  left  transform  and  then  a  right  transform  on  the 
data  matrix.  In  matrix  notation,  we  want  to  compute  the  product  of  three  matrices: 


left 

data 

right 

transform 

matrix 

transform 

matrix 

matrix 

Here  we  assume  the  first  and  the  third  transform  matrix  can  be  formed  locally  at  each 
processor.  The  left  transformation  (i.e.,  finding  the  product  of  first  two  matrices)  can  be 
readily  performed  in  parallel.  To  perform  the  second  transformation  (i.e.,  find  the  prod¬ 
uct  of  the  resulting  matrix  from  the  first  transform  and  the  third  matrix)  in  parallel,  we 
need  to  transpose  the  data  matrix  since  it  is  not  partitioned  correctly  for  that  operation. 
This  example  is  applicable  to  the  two-dimensional  fast  Fourier  transform  (FFI)  algorithm 
(see  section  4.1)  and  to  the  singular  value  decomposition  of  a  matrix.  We  will  discuss 
strategies  for  implementing  a  matrix  transpose  on  the  iWarp  array  in  section  4. 

4  MAPPING  APPLICATIONS  TO  THE  iWARP  SYSTEM 

In  this  section,  we  describe  strategies  for  the  mappings  of  two-dimensional  FFT,  two- 
dimensional  convolution,  some  image  processing  and  neural  network  algorithms,  and 
some  numerical  linear  algebra  algorithms  onto  a  message-passing  parallel  system.  We 
also  present  performance  results  obtained  from  the  implementations  of  some  of  these 
algorithms  on  the  iWarp  system.  We  basically  used  the  data  partition  model  in  all  our 
discussions  and  implementations.  However,  the  function  partition  model  may  be  used  as  a 
global  strategy  when  we  want  to  map  a  composite  signal  or  image  processing  algorithm  to 
such  a  parallel  system  using  some  of  the  applications  discussed  here  as  components. 

4.1  TWO-DIMENSIONAL  FFT 

Two-dimensional  discrete  Fourier  transform  is  a  tool  frequently  used  in  signal  and 
image  processing  algorithms.  Its  mathematical  definition  is 
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H{nl,n2) 


Nj-lNi-l 

E  E 


AisO  Al=:0 


«i(nj,  fc2)tj(ni,  ki)h{ku  ifj), 


7*1  =  0  •  •  •  Ni  —  1 ,  nj  =  0  •  •  •  —  1 ; 


where  =  ezp(2xink/Ni) .  We  can  also  write  the  above  equations  into  a  matrix  equa¬ 

tion  that  is  more  instructive  for  our  purpose 


F(l,l)  ...  H{1,N2) 

•  •  t  •  • 

H{Nul)  •••  H{NuN2) 
••• 


fc(l,l)  ... 


••• 


hitfuNi) 


MJVi.l) 


h(NuN2) 


t2{N2,l) 


t2{N2,N2) 


This  matrix  equation  shows  more  clearly  what  needs  to  be  done  if  we  want  to  parallel¬ 
ize  the  row-column  transform  algorithm  for  the  two-dimensional  Fourier  transform.  Sym¬ 
bolically  (see  reference  5),  the  row-column  transform  algorithm,  e.g.,  can  be  arranged  as 


B(ni,n2)  =  FT-on-index-2(PT-on-mdex-l[h(fci,fc2)]). 


In  matrix  notation,  this  is  equivalent  to  computing  the  product  of  the  first  two  matrices 
and  then  computing  the  product  of  the  first  product  matrix  with  the  last  matrix.  Suppose 
we  let  each  processor  hold  a  few  columns  of  the  data  matrix.  The  first  matrix  multiplica¬ 
tion  can  then  be  readily  parallelized.  To  do  the  second  parallel  matrix  multiplication, 
however,  we  first  need  to  transpose  the  first  product  matrix.  Thus,  a  simple  parallel 
two-dimensional  FFT  algorithm  using  row-column  transform  is 

1.  scatter  the  two-dimensional  data  array  by  columns  to  an  iWarp  array; 

2.  perform  a  parallel  one-dimensional  FFT  on  each  processor  in  the  iWarp  ar¬ 
ray; 

3.  transpose  the  data  array  distributed  on  the  iWarp  array; 

4.  perform  another  parallel  one-dimensional  FFT  on  each  processor  in  the 
iWarp  array;  and, 

5.  collect  the  transformed  data  array  for  the  next  processing  step. 

Now,  if  we  have  an  efficient  one-dimensional  FFT  code  running  on  each  processor  of 
a  parallel  system,  the  efficiency  of  the  matrix  transpose  operation  clearly  has  a  great 
influence  on  the  performance  of  the  two-dimensional  FFT  code.  In  matrix  transpose 
operation,  each  processor  needs  to  send  some  data  to  each  of  the  other  processors  in  the 
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network.  An  efficient  implementation  of  the  matrix  transpose  operation  should  use  as 
many  available  communication  paths  as  possible.  Figures  4  and  5  show  one  way  of  per¬ 
forming  the  matrix  transpose  on  a  mesh  connection  of  processors. 

This  matrix  transpose  strategy  basically  consists  of  two  steps.  The  first  step  performs 
data  transfer  between  columns  of  processors.  In  the  first  step,  for  example,  proces¬ 
sor  (1,1)  will  send  to  processor (1,2)  all  the  data  needed  by  processors  in  the  second  col¬ 
umn.  Thus  processor(l,2)  needs  a  temporary  buffer  to  store  the  received  data  for  proces- 
sors(i,2)  for  t  =  2,  •  •  • ,  height-of-network.  In  the  case  of  each  processor  containing  more 
than  one  row  of  data,  a  local  transpose  is  also  performed  at  each  processor  in  the  first 
step.  The  second  step  involves  data  transfer  between  rows  of  processors.  In  the  second 
step,  for  example,  processor(l,l)  will  send  data  stored  in  its  temporary  buffer  to  proces- 
sors(i,l)  for « =  2,  •  •  • ,  height-of-network.  Using  this  matrix  transpose  strategy,  the  number 
of  pathways  used  simultaneously  on  a  square  iWarp  array  is  Vm,  where  M  is  the  number 
of  processors  in  the  iWarp  array.  A  major  drawback  of  this  strategy  is  the  need  to  store 
temporary  data  on  each  processor,  which  is  unfavorable  if  each  processor  has  a  limited 
amount  of  local  memory  and  the  data  size  of  the  application  is  large. 

Another  way  to  implement  a  matrix  transpose  is  illustrated  in  figures  6  and  7.  This 
matrix  transpose  strategy  also  consists  of  two  steps.  In  the  first  step,  each  column  of 
processors  sends  data  needed  by  processors  on  the  same  row  but  different  columns.  In  the 
second  step,  all  processors  but  one  on  a  certain  row  send  data  to  a  different  row  of 
processors;  the  remaining  processor  does  a  local  transpose;  this  pattern  is  repeated  for  all 
processors  on  the  row.  This  transpose  strategy  uses  pathways  simultaneously. 

The  strategy  may  be  more  efficient  since  all  data  transfer  can  be  implemented  from  the 
source  to  the  final  destination  without  using  extra  local  memory. 

We  also  point  out  that  operation  count  for  the  row-column  two-dimensional  FFT  is 
0(2n’log3fi).  The  row-column  parallel  algorithms  discussed  above  have  an  operation 
count  0(2n’ log,  n/P),  where  P  is  the  number  of  processors. 

We  measured  performances  of  the  matrix  transpose  operation  and  the  first  two- 
dimensional  FFT  algorithm  from  our  implementations.  Performances  on  B-Step  and 
C-Step  processors  are  compared.  Tables  1  through  3  show  the  execution  time  for  a  com¬ 
plex  matrix  transpose  operation  for  three  different  sizes  of  matrices.  The  first  perform¬ 
ance  data  in  table  3  are  missing  because  the  execution  could  not  complete  in  that  case. 
We  guess  it  may  be  that  the  data  size  required  for  each  processor  exceeds  the  memory 
available  for  an  application  progrjim  on  each  processor.  Figures  8  and  9  and  table  4  give 
the  performance  results  for  two-dimensional  FFT  computations. 
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This  column  of  processors  does  a  local  transpose, 

sends  data  needed  by  all  rows  of  processors  to 
processors  at  the  same  row,  different  columns. 
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Figure  4.  Step  1:  Data  exchange  between 
columns  of  processors. 
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This  row  of  processors 
sends  data  to  processors 
at  the  same  column,  different 
rows. 
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Figure  5.  Step  2:  Data  exchange  between  rows 
of  processors. 
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This  column  of  processors  sends  data 
to  processors  at  the  same  row,  different 
columns. 
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Figure  6.  Step  1:  Data  exchange  between 
columns  of  processors. 


Figure  7.  Step  2:  Data  exchange  between  rows 
of  processors. 
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Table  1.  Performance  of  matrix  transpose  (128  x  128). 


COMPLEX  MATRIX  TRANSPOSE  OF  128  x  128 

#  PROCESSORS 

4 

16 

32 

64 

TIME  (ms)  (B-STEP) 

30 

13 

9 

7 

TIME  (ms)  (C-STEP) 

15 

6 

4 

3 

Table  2.  Performance  of  matrix  transpose  (256  x  256). 


COMPLEX  MATRIX  TRANSPOSE  OF  256  x  256 

»  PROCESSORS 

4 

16 

32 

64 

TIME  (ms)  (B-STEP) 

125 

52 

40 

29 

TIME  (ms)  (C-STEP) 

63 

25 

18 

14 

Table  3.  Performance  of  matrix  transpose  (512  x  512). 


COMPLEX  MATRIX  TRANSPOSE  OF  512  x  512 

»  PROCESSORS 

4 

16 

32 

64 

TIME  (ms)  (B-STEP) 

*** 

232 

162 

121 

TIME  (ms)  (C-STEP) 

167 

79 

55 

14 


Execution  Time  in  Seconds 


Figure  8.  Performance  of  two-dimensional  FFT  (128  x  128). 
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Figure  9.  Performance  of  two-dimensional  FFT  (256  x  256). 
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Table  4.  Performance  of  two-dimensional  FFT 
(512  X  512). 


TWO-DIMENSIONAL  FFT 

512  X  512 

#  PROCESSORS 

16 

64 

TIME  (sec)  (B-STEP) 

4.814 

0.923 

TIME  (sec)  (C-STEP) 

2.322 

0.403 

It  is  interesting  to  see  from  these  two-dimensional  FFT  performances  that  the  execu¬ 
tion  time  is  more  than  doubled  when  the  number  of  processors  used  is  just  halved.  This  is 
because  the  transpose  time  increases  as  the  number  of  processors  used  decreases.  It  can 
be  seen  from  tables  2  and  3  that  on  an  8  x  8  iWarp  array,  the  communication  time  (i.e., 
the  matrix  transpose)  takes  about  13%  of  the  total  execution  time  for  the  two-dimensional 
FFT  of  data  size  512  x  512,  but  the  communication  takes  about  20%  of  total  execution 
time  for  a  data  size  of  256  x  256.  The  ratio  of  communication  to  computation  increases 
as  the  the  data  size  decreases.  We  also  note  that  the  actual  performances  on  C-step 
processors,  using  the  compiler  designed  for  the  B-step  processor,  for  both  matrix  trans¬ 
pose  and  two-dimensional  FFT,  are  twice  as  fast  as  on  B-step  processors.  The  actual 
performance  on  a  C-step  processor  is  expected  to  improve  significantly  when  the  new 
compiler  for  that  chip  (not  available  to  us  now)  is  used.  Using  an  8  x  8  array  of  C-step 
processors,  the  achieved  performances  of  two-dimensional  FFT  is  about  57  MFLOPS  for 
a  128  X  128  complex  data  array,  45  MFLOPS  for  a  256  x  256  complex  data  array  and 
43  MFLOPS  for  a  512  x  512  complex  data  array. 


4.2  TWO-DIMENSIONAL  CONVOLUTION 


Convolution  is  a  fundamental  operation  in  signal  and  image  processing  since  it  is  a 
filtering  operation.  Hence,  a  fast  two-dimensional  convolution  operation  is  important  to 
speed  up  many  image  processing  applications.  The  mathematical  definition  of  the  discrete 
two-dimensional  convolution  is 

00  00 

y(nl,na)s=  51  *(**i**2)M”i  “  “  ^a)»  (2) 

Ais.— oo  Ai*— wo 
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where  we  assume  h(, )  is  the  convolution  kernel.  In  typical  cases,  we  can  also  assume 

^  0  only  for  0  <  ni,nj  <  JV  - 1, 

Hniftii)  ji  0  only  for  0  <  ni.nj  <  Jlf  -  1, 

and  N  »  M.  With  the  above  given  supports  for  input  and  kernel  functions,  it  is  seen  that 

y(nii«2)  ^  0  only  for  0  <  ni.tia  <  +  Af  - 1  , 

It  can  be  shown  that  (see  reference  6)  the  two-dimensional  convolution  using  equa¬ 
tion  2  has  an  operation  count  O  {{N  +  M- 2)*JI/*) .  If  we  assume  the  convolution  opera¬ 
tion  is  an  intermediate  step  of  a  larger  (parallel)  processing  algorithm  and  that  the  data 
are  partitioned  by  rows  consecutively  (here  we  assume  the  first  variable  of  is  row 

index),  then  an  obvious  way  to  do  a  parallel  two-dimensional  convolution  is  the  following: 

1.  Broadcast  or  compute  locally  the  kernel  function  A(ni,n2) ; 

2.  Each  processor  computes  for»*i  that  falls  in  the  range  of  the  input 

data  x(ni,na)  and  all  na. 

Since  there  are  JV'  +  Af  -  2  nonzero  output  rows,  the  processor  that  holds  the  last  few 
rows  of  input  data  will  have  M  -1  more  output  rows  to  compute.  Since  M  «  N  and 
there  should  be  fewer  computations  for  the  last  few  output  rows  due  to  the  fact  that 
x(ni,n2)  =  0  for  ni  >  AT,  workload  is  largely  balanced  in  this  case.  Since  computing  out¬ 
put  data  points  at  the  input  data  partition  boundary  will  need  input  data  rows  outside  the 
boundary,  proper  input  data  overlapping  is  necessary,  which  means  the  data  transfer 
operation  between  neighboring  processors  (in  terms  of  data  partition,  not  physical  loca¬ 
tion  of  processors)  must  be  performed  before  the  parallel  convolution  starts.  The  opera¬ 
tion  count  for  the  parallel  convolution  is  approximately  O  ((iV  +  -  2)*Jlf*)  /P,  where  P 

is  the  number  of  processors. 

If  the  kernel  function  is  separable,  we  can  write 

00  00 

y(»»i>na)=  2-»  51  *(*x»^sa)M”a  “  *a)- 

li|s-ee  h$=—oo 

Now  if  we  first  compute 

/(*i.»*a)=  53  »(*x**aW»3  -  *a) 

Ags— 00 

for  all  ki ,  and  for  which  /(kjfih)  is  nonzero,  then  compute 

00 

y(ni,na)= 

Ais— oe 
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for  all  nonzero  output  data  points;  the  total  operation  count  from  these  two  steps 
is  approximately  (see  reference  6)  NM(N  +  M  +  M  -l)* ,  which  is  about  a 

factor  of  ■M’/2  less  than  that  for  a  nonseparable  kernel.  It  is  not  difficult  to  see 
that  the  reduction  in  operation  count  for  a  separable  kernel  can  be  achieved  in  the  parallel 
implementation  discussed  above  in  essentially  the  same  way  as  in  the  sequential 
implementation.  The  operation  count  for  the  parallel  implementation  is 
O  {NM{N  +  M  -  1)/P  +  M{N  +  Af  -  1)’/P) ,  where  P  is  the  number  of  processors. 


4.3  NUMEKICAL  LINEAR  ALGEBRA  ALGORITHMS 


In  this  section,  we  present  several  basic,  parallel  numerical  linear  algebra  algorithms 
and  their  implementations  on  a  message-passing  parallel  system.  These  linear  algebra 
tools  are  frequently  used  in  many  signal  and  image  processing  algorithms.  Most  linear 
algebra  algorithms  belong  to  the  category  of  fine-grain  applications,  which  means  the 
ratio  of  computations  and  communications  is  not  very  high.  For  such  applications,  mini¬ 
mizing  communication  overhead  is  crucial  to  get  a  good  parallel  performance. 

Broadly  speaking,  linear  algebra  algorithms  can  be  classified  into  two  types,  one  for 
solving  a  linear  system  of  equations,  the  other  for  finding  eigenvalues  of  a  matrix  under 
various  assumptions.  The  computation  could  become  very  expensive  when  the  problem 
size  (or  the  matrix)  is  large  or  when  it  is  necessary  to  solve  a  certain  problem  many  times. 
For  real-time  applications,  the  problem  size  may  not  be  very  large,  but  the  solution  to  a 
problem  must  be  obtained  as  quickly  as  possible.  All  these  situations  require  us  to  have 
an  efficient  way  of  problem  solving.  Penallel  processing  is  a  popular  choice  in  that  regard. 

Before  constructing  a  parallel  algorithm  for  mapping  to  a  parallel  system,  we  may 
want  to  think  about  possible  ways  of  data  distribution  on  a  network  of  processors.  Under 
the  assumption  that  we  want  to  distribute  a  two-dimensional  data  array  by  rows  or  by 
columns,  there  are  two  common  data  distribution  strategies.  The  first  strategy  is  to  have 
each  processor  hold  a  few  consecutive  rows/columns  of  data.  The  second  strategy  is  to 
distribute  rows/columns  of  data  in  a  round-robin  way.  Here  is  an  example  of  the  second 
data  distribution  strategy.  Suppose  we  want  to  partition  a  matrix  with  eight  columns 
among  four  processors.  The  round-robin  way  of  data  distribution  will  put  first  and  fifth 
columns  in  the  first  processor,  the  second  and  sixth  colunms  in  the  second  processor,  the 
third  and  seventh  colunms  in  the  third  processor,  and  the  fourth  and  eighth  columns  in 
the  fourth  processor. 

We  now  present  these  parallel  algorithms  in  a  pseudo-code  form.  We  also  discuss 
implementation  issues  and  give  performance  results  on  the  iWarp  system.  More  detailed 
explanations  of  some  of  these  algorithms  can  be  found  in  references  7,  8,  9,  and  10. 
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Performance  results  for  these  linear  algebra  algorithms  were  obtained  from  B-Step  proc¬ 
essors.  We  expect  the  performance  on  C-step  processors  would  be  doubled  using  the 
current  compiler,  as  has  been  verified  on  the  two-dimensional  FFT  code. 

Suppose  we  need  to  solve  a  symmetric  linear  system  and  the  coefficient  matrix  is 
positive  definite;  Cholesky  algorithm  can  then  be  used.  Here  is  a  parallel  Cholesky  algo¬ 
rithm  that  can  be  implemented  on  a  ring  network  of  processors  (reference  7).  We  imple¬ 
mented  this  algorithm  using  express  commimication  with  a  ring  connection  of  processors 
on  the  iWarp  system.  The  performance  on  a  512  x  512  data  matrix  was  measured  and  is 
shown  in  figure  10. 


A  Ring  Parallel  Cholesky  Algorithm 


a  ^  0,  j  ^  0,  ji  ^  1; 

loti  l)p; 

wUle  j  ^  last 

i  ♦“  «  + 1; 

Generate  G(j  :  n,j)  in  fljoe(i  : «)  and  copy  it  into  RtocU  :  n»ii); 
U*<k 

aend(jjoe(i : 

ttpdoie  + 1 :  i); 

a  *-4  +  1; 
end 

ji  =  + 1; 

elae 

r«ce»v«(yipc(a  +  1 :  k,Uft)\ 
if  a  +  1  €  {right,  right  +  p,  •  •  • ,  right  -!-(*-  l)p} 
aend(ji«e(a  +  1  ;  n),right); 
end 

a«- J-H; 

update  Rioci’.Ji  ■  *); 
end 
end 


After  matrix  factorization  (an  LU  algorithm  can  be  used  for  nonsymmetric  systems), 
we  need  to  solve  some  triangular  linear  systems  to  get  the  solution  to  the  linear  system. 
Here  is  a  parallel  program  for  implementing  a  triangular  system  solver.  This  algorithm  is 
called  a  scalar-sum,  fan-in  algorithm  in  reference  10.  The  result  of  the 
fan  -  in(iuffer,pid)  operation  is  that  all  processors  send  out  a  number  stored  in  buffer, 
and  the  processor  with  ID  equal  to  P*<i  gets  the  sum  of  those  numbers,  including  its  own 
part.  Map(i)  is  the  processor  ID  that  contains  the  »th  column  of  the  matrix.  How  the 
fan-in  operation  (which  is  a  global  sum  operation)  is  carried  out  depends  on  the  processor 
network. 
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Figure  10.  Performance  of  Cholesky  factorization. 
A  “Fan'In”  Parallel  Triangular  System  Algorithm 


for  i  s  1 :  n 
s^O 

tor  j  —  1  :i-l 

if  j€  {rid,rid+  p,  -  ’  -  ,rid+  (k  -  l)p} 

j  =  •  +  Lijxj; 

end 

I  =  faii'm(i,  ini9(i)) 

If  j  €  {r»d,r«d+p,--',r«d+(fc-l)p} 

*i  =  (fti -«)/!« 

end 

The  above  triangular  system  solver  algorithm  can  be  implemented  on  various  types  of 
processor  networks.  We  tried  ring  and  two-dimensional  mesh  implementations.  In  the  ring 
implementation,  we  just  accumulate  the  sum  one  processor  at  a  time  along  the  ring.  In  the 
mesh  implementation,  we  use  the  reduction  sum  tool  to  do  the  fan-in  operation.  We  found 
that  the  mesh  implementation  is  30%  faster.  Figures  11  and  12  show  the  performance 
results  for  the  mesh  implementation. 
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Execution  Time  in  Milliseconds  Execution  Time  in  Milliseconds 


Figure  11.  Performance  of  solving  a  linear  triangular  system  (256  x  256). 


Figure  12.  Performance  of  solving  a  linear  triangular  system  (512  x  512). 
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It  can  be  seen  that  the  performance  does  not  improve  much  as  the  number  of  proces¬ 
sors  used  exceeds  16.  This  is  somehow  expected  since  the  algorithm  we  used  still  contains 
a  substantial  sequential  part.  Another  parallel  algorithm  for  solving  a  triangular  linear 
system,  called  the  wave-front  algorithm  in  reference  10,  is  also  discussed  in  reference  10, 
Assuming  a  column-oriented  data  partition,  here  is  the  pseudo-code  for  an  upper- 
triangular  system: 


A  Wavefront  Parallel  Triangular  System  Algorithm 


for  j  €  mycoU 

for  Jb  =  1  to  ^segments 
ifi!  =  l  receive  segment 
if  ft  =  1  then 

“  i^3  ~  ^3)1  ^'33 
segment  =  segment  -  {zj} 

for  Zj  €  segment  z,  = 
if  \segment\  >  0  then 

send  segment  to  processor  map{j  +  1). 


In  the  above  wavefront  algorithm,  each  column  of  the  upper  triangular  matrix  is 
divided  into  a  number  of  segments.  Each  segment  has  a  length  tr  of  components.  Profes¬ 
sor  map(l)  first  computes  »i ,  then  proceeds  to  compute  the  components  of  the 

first  segment  vector  *  for  the  first  column.  Once  computing  z  is  done,  processor  map(l) 
sends  z  to  processor  map(2)  so  that  the  latter  can  compute  *3  and  then  begin  further 
updates  of  * .  Meanwhile,  processor  map(l)  has  resumed  work  on  the  next  segment  of  the 
first  column.  A  more  detailed  explanation  of  this  algorithm  can  be  found  in  reference  10. 
Intuitively,  it  may  perform  better  than  the  fan-in  algorithm,  since  by  controlling  the 
parameter  ,  it  is  possible  to  make  all  processors  busy  most  of  the  time.  We  think  a  good 
choice  of  the  parameter  <r  is  to  let  it  equal  the  number  of  processors.  We  have  not 
implemented  the  wavefront  algorithm  at  this  moment.  But  the  idea  of  this  algorithm  is 
very  interesting,  and  it  could  be  useful  in  parallelizing  other  applications. 

Orthogonal  factorization  of  a  matrix  is  a  useful  tool.  For  example,  a  regular  QR  fac¬ 
torization  on  a  matrix  with  full  column  rank  is  a  numerically  stable  and  computationally 
efficient  approach  to  computing  a  solution  to  least  square  problems.  QR  factorization  can 
also  be  used  to  find  eigenvalues  of  a  matrix.  We  now  list  a  parallel  QR  algorithm  using 
Householder  transformations  on  a  ring  network  of  processors. 
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A  Ring  ParaUel  QR  Algorithm 


last «-  »  +  (ib  -  l)p; 
neztjrid  *-  (*  +  1)  mod(j>); 

neat JiMt  «-  nextjrid  +  (i  *-  l)p; 
while  e  <  n 

if  4  6  {t,*  +  p,  •••,la4t} 

v(i  :  m)  *-  hou8e(Aioe(i  :  ni,  j)); 

Store  v{j  :  m)  into  Q{j  :  tn,  j); 
send(v{j :  m)tright)] 

A<oe(i  :  n*,  j  :  k)) 

*-  row.faouie(A{oe(i  :  m,i  :  k)),v{j  :  m)); 
s*-s+l,j*-j  +  V, 
else 

recett;e(v(j  :  m),/e/t); 
if  4  ^  {nextjrid, neztjrid + p, ** * , next Ja4t} 
send(v(j :  m),  right); 
end 

Store  v(j  :  m)  into  Q{j  :  m,  j); 
if  4  <  last 

Aioe(i  :  mj  :  k) 

row.hou8e(A{oe(i  :  m,j  :  k)),v{j :  m)); 

end 

< + 1; 

end 

end 

The  performance  results  for  QR  factorization  are  given  in  figures  13  and  14.  We  can 
see  that  the  ptirallel  performance  for  this  matrix  factorization  is  much  better  than  that  of 
solving  a  trianguleir  system.  This  is  unfortunate  for  applications  where  one  needs  to  solve 
a  linear  system  with  one  coefficient  matrix  and  many  right-hand-side  vectors. 

These  linear  algebra  algorithm  tools  have  been  used  to  implement  an  acoustic  signal 
processing  algorithm.  Specifically,  we  picked  the  Minimum  Variance  Distortionless 
Response  (MVDR)  beamformer  computation  as  a  test  case.  In  MVDR  beamforming  (ref¬ 
erence  11),  one  needs  to  compute  the  following  quantity  (called  beam  power) 


where  Re  =  XX\  X  is  a  sensor  output  data  matrix  here,  and  we  assume  X  has  fall 
column  rank.  One  way  to  compute  Rmvdr  is  to  perform  a  QR  factorization  on  X  first, 
then  we  can  write  Puydr  as 

Pmvdr  =  ((i^-*g'£?)'(J^“*g's))-^ 


23 


Execution  Time  in  seconds  Execution  Time  in  seconds 


24 


where  Q  and  R  are  the  matrix  factors  from  QR  factorization  of  X.  Thus,  the  MVDR 
computation  using  QR  consists  of  a  QR  factorization,  a  matrix-vector  multiplication,  and 
a  solution  of  a  linear  upper  triangular  system.  We  do  not  give  further  details  about  the 
MVDR  implementation  in  this  report  (these  can  be  found  in  reference  7),  but  give  the 
performance  results  for  a  512  x  512  data  matrix  using  different  numbers  of  processors  in 
figure  15. 
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Figure  15.  Performance  of  a  MVDR  beamformer. 

For  a  rank-deficient  matrix,  singular  value  decomposition  (SVD)  of  the  matrix  is  a 
better  approach  for  finding  the  minimum  norm  solution  to  the  least  square  problem,  (see 
reference  8).  SVD  also  provides  the  option  of  a  reduced  rank  approximation  in  the  case 
of  an  ill-conditioned  matrix,  which  is  accomplished  by  zeroing  the  singular  values  that  are 
smaller  than  a  prescribed  threshold.  Again  the  programming  tools  we  have  developed, 
including  the  matrix  transpose  function,  can  be  used  to  implement  a  parallel  SVD  algo¬ 
rithm  on  a  distributed  memory,  message-passing  system.  The  SVD  theorem  states:  if  A  is 
a  real,  m  x  n  matrix,  then  there  exist  orthogonal  matrices 


such  that 


U^AV  =  diag{ai,  •  •  • ,  Op)  6  R"**" 
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where  •  •  •  >  0 .  There  are  several  ways  to  find  the  SVD  factorization  of  a  matrix; 

some  of  them  need  to  form  the  matrix  A’^A ,  which  could  give  rise  to  an  ill-conditioned 
matrix  to  work  on.  We  now  describe  a  parallelization  strategy  for  a  SVD  algorithm  pro¬ 
posed  in  reference  8.  This  SVD  algorithm  first  transforms  the  matrix  into  a  bidiagonal 
form 


6i  • *  •  0 

aj  *  • .  : 

•  •  « 

•• 

*  *  ■ 

••  in-l 

0  On  . 

Then  some  techniques  similar  to  the  symmetric  QR  algorithm  also  discussed  in  refer¬ 
ence  8  are  used  to  diagonalize  the  matrix  B.  This  sequential  SVD  algorithm  is  listed 
below. 


A  Singular  Value  Decomposition  Algorithm 

Use  HouBeholder  matrices  to  bidiagonalize  the  input  m  x  n  (m  <  n)  matrix  A: 

where  Ui  is  n  x  m,  is  n  X  n  and  B  is  n  x  n; 
until  q^n 

For  any  *  =  1  :  n  -  1 

Set  ai,i+i  to  zero  if  |a,v+i|  <  +  K-n.<+i|); 

Find  the  largest  q  and  the  smallest  p  such  that  if 


5  = 


flu  0  0 

0  S22  0 

0  0  fl33 


(where  flu  is  p  x  p,  fljj  is  (n  -  p  -  g)  x  (n  -  p  -  q)  and  flss  is  g  x  g) 
then  Baa  i>  diagonal  and  Baa  has  no  nonzero  superdiagonal; 
if  g  <  n 

if  any  diagonal  entry  in  Baa  is  sero, 

Zero  the  snperdiagonal  entry  in  the  same  row, 
else 

Apply  a  symmetric  QR  step  to  fljj, 

end 

end 

nod 

The  bidiagonalization  step  of  this  algorithm  basically  has  an  operational  count  of  twice 
that  of  QR  factorization  when  man  and  is  therefore  0(n*) .  The  second  step  of  this 
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algorithm  works  on  a  bidiagonal  matrix  to  annihilate  superdiagonal  elements  through  an 
iteration  process.  If  we  need  to  do  a  SVD  on  a  matrix  in  a  signal  or  image  processing 
algorithm,  under  the  assumption  that  the  data  matrix  is  distributed  in,  say,  columns,  the 
first  step  can  be  parallelized  by  multiplying  a  transformation  matrix  to  -A  from  left,  trans¬ 
posing  the  resulting  matrix,  multiplying  a  transformation  matrix  to  A.  from  right,  trans¬ 
posing  the  resulting  matrix  again,  and  so  on.  Since  at  each  step  we  are  working  on  a 
matrbc  of  smaller  size,  the  size  of  the  submatrbc  to  be  transposed  may  also  decrease.  This 
will  require  some  additional  decision  code  in  a  matrix  transpose  program.  The  second 
step  of  the  algorithm  is  perhaps  even  more  communication  intensive  and  it  is  probably 
efficient  to  do  it  in  a  single  processor.  In  summary,  SVD  is  a  fine-grain  computation  and 
it  contains  some  sequential  parts  that  seem  to  make  it  harder  to  get  a  good  parallel 
performance. 

4.4  THE  IMPLEMENTATION  OF  AN  IMAGE  WEIGHTED  FRAME- 
DIFFERENCING  SCHEME 

Using  our  parallel  programming  tools  described  in  section  3,  we  implemented  a  multi- 
spectral,  weighted  frame-differencing  scheme  on  the  iWarp  array.  Given  a  set  of  images 
A(i),i  =  1, •••,!»,  we  would  like  to  compute  cross-covariances  of  pairs  of  images  and  out¬ 
put  images  of  the  form 

=  -^(0  “  P{h3)A{j) 

where  is  the  cross-covaritmce  between  A(i)  and  A(j).  This  operation  is  often  used 
as  one  of  the  steps  of  a  detection  algorithm  for  identifying  small  targets  embedded  in 
background  clutter,  under  the  assumption  that  background  clutter  is  statistically  correlated 
between  different  image  freunes  and  targets  are  not.  A  pseudo-program  for  performing 
this  is  as  follows; 


A  Parallel  Weighted  Differencing  Program 

hiput  and  scatter  a  set  of  images  by  rows  to  the  iWarp  array; 

Normalize  each  image  by  rows  at  eadi  processor; 

Compute  local  sum  of  pixel  values  of  images; 

Compute  global  sum  of  pixel  values  of  images; 

Compute  means  of  pixd  values  of  images  at  one  processor; 

Broadcast  the  means  to  all  processors; 

Compute  partial  cross-covariances  of  each  pair  of  images  at  eadi  processor; 
Using  the  ^obal  sum  to  get  complete  cross-covariances; 

Broadcast  the  cross-covariances  to  all  pro^sors; 

Cmnpute  wei^ted  frame-differences  at  each  processor; 

Gather  and  output  differenced  images  from  the  iWarp  array. 
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Except  for  distributing  and  gathering  the  data  array,  the  weighted  frame-differencing 
operation  is  highly  parallelizable,  an  example  of  a  coarse-grain  application.  In  figure  16, 
we  display  a  performance  curve  of  execution  time  versus  the  number  of  processors  used, 
where  the  execution  time  is  measured  after  data  scattering  and  the  weighted  difference  is 
performed  on  two  images  of  size  128  x  128.  The  performance  was  measured  on  the 
C-step  array.  Almost  linear  speed-up  is  achieved,  as  it  should  be  expected  for  a  typical 
coarse-grain  application. 
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Figure  16.  Performance  of  the  weighted  differencing  program. 


4.5  ON  PARALLELIZATION  OF  A  TWO-DIMENSIONAL  ADAPTIVE  LMS 
ALGORITHM 

The  two-dimensional  adaptive  LMS  algorithm  (TDLMS),  as  described  in  reference  12, 
is  an  optimal  filtering  algorithm  with  respect  to  the  mean  squared  error  measure.  The 
algorithm  can  be  used  for  image  enhancement  and  is  said  to  to  be  an  improvement  to 
Wiener  filtering  when  applied  to  nonstationary  image  signals.  The  TDLMS  algorithm  as 
described  in  reference  12  is  inherently  sequential.  We  now  propose  a  modification  to  that 
algorithm  to  make  it  possible  for  an  efficient  parallel  implementation.  We  now  give  a 
brief  description  of  the  TDLMS  algorithm  and  show  how  it  may  be  modified  for  a  possi¬ 
ble  parallel  implementation  on  the  iWarp  system. 
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Suppose  we  have  two  input  images,  each  of  dimension  M  by  M.  We  call  the  first  image 
the  desired  image,  D,  and  the  second  image  the  reference  image,  X.  We  want  to  filter  the 
reference  image  by  the  operation 


N-lS-l 

y(m,n)=  E  E 

i=0  i=!0 

where  i  is  a  pixel  index  with  i  =  ^nM  +  n .  The  error  signal  is  defined  as 


N-lN-l 

=  2)(m,n)  -  y(m,n)  =  £>(m,n) “EE  W’,(i,i)X(m-  l,n-k). 

t=0  3=0 

for  the  ith  iteration.  The  aim  is  to  obtain  a  set  of  weights  W(i,j)  (which  can  be  seen  as  a 
N  by  N  matrix)  so  that  the  quantity 

Mean-Sqaared-Error  =  E(e|) 

is  minimized  for  all  i .  Wiener  filtering  does,  this  optimization  by  evaluating  the  mean- 
squared-error  J5(eJ)  and  finding  the  optimal  weight  matrix  through  solving  a  linear  sys¬ 
tem  of  equations  with  an  autocorrelation  matrix  as  the  coefficient  matrix,  under  the  as¬ 
sumption  of  wide-sense  stationarity.  The  adaptive  LMS  algorithm  described  in  refer¬ 
ence  12  proceeds  as  follows.  Using  the  steepest  decent  idea,  one  can  modify  the  weight 
matrix  recursively: 


Wj+i  -  Wj  -  vGj, 

where  v  is  an  adjustable  parameter  used  to  control  the  convergence  speed  and  Gj  is  the 
gradient  matrix  of  E{e))  with  respect  to  Wj  : 


Gj(m,n)  = 


dW(m,ny 


For  practical  application,  one  may  use  in  place  of  its  statistical  average.  Therefore, 
the  derivatives  Gj(m,n)  can  be  computed  using  equation  3  and  we  can  get  an  explicit 
equation  for  updating  the  weight  matrix: 


Wj+i(I,  i)  =  Wj-h  2uejXim  -l,n-k). 


It  is  hoped  that  this  iterative  algorithm  will  converge  to  a  weight  matrix  that  minimizes  the 
error  «i  for  all  i . 

Since  the  above  adaptive  algorithm  iterates  on  the  pixel  index  j,  it  is  difficult  to 
parallelize  its  computation.  However,  since  the  goal  is  to  minimize  ^j  for  all  j ,  we  argue 
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that  the  sequential  updating  order  for  the  weight  matrix  is  not  necessarily  the  best  one. 
There  are  other  ways  to  update  the  weight  matrix  that  may  be  more  efficient.  For  exam¬ 
ple,  we  can  first  compute  the  errors  for  all  3  and  then  average  them.  The  updating 
equation  for  the  weight  matrix  thus  becomes 


S  (eiJr  (n.  -  I,  n  -  t)) . 
i=o 


(4) 


Using  the  updating  equation  4,  we  actually  replace  the  ensemble  average  defined  by 
E{ej)  with  the  spatial  average,  hi  addition,  this  updating  operation  can  be  parallelized 
easily.  Here  is  a  pseudo-program  for  a  parallel  implementation  of  the  adaptive  LMS  algo¬ 
rithm: 

Distribute  input  images  by  rows  (or  by  cduznns)  to  all  processors; 

/*  some  overlapping  of  rows  of  data  is  needed  */ 

Compute  and  broadcast  the  initial  wei^t  matrix; 

Each  processor  computes  error  Cj  for  indices  j  belonging  to  this  processor; 

Each  processor  computes  a  local  sum  of  ej; 

Do  a  global  sum  of  ej  and  put  the  result  in,  say,  processor  0; 

Processor  0  computes  the  average  of  Cj  and  iQ>date8  the  weight  matrix  W] 

If  the  average  error  is  less  than  a  threshold; 

processor  0  output  wei^t  matrix,  all  processors  stop  execution; 

Xlae 

Processor  0  broadcast  weight  matrix  to  all  processors, 
all  processors  repeat  from  step  3. 


4.6  ON  A  PARALLEL  NEURAL  NETWORK  TRAINING 

The  neural  net  has  become  a  useful  tool  in  many  signal  processing  and  pattern  recog¬ 
nition  applications.  We  now  describe  the  use  of  our  programming  tools  to  implement  a 
backpropagation  (BP)  neural  network  training.  A  BP  neural  net  is  a  mapping  network  that 
can  approximate  a  multidimensional  function.  A  detailed  discussion  on  this  network  can 
be  found  in  many  books  about  neural  networks,  e.g.,  reference  13.  Given  a  set  of  training 
data  consisting  of  pairs  of  desired  inputs  and  outputs,  the  task  of  training  the  BP  neural 
net  is  to  adjust  the  parameters  of  the  network  (called  weights)  so  that  the  difference 
(error)  between  the  output  of  the  neural  net  and  the  desired  output  is  minimized.  The 
error  function  of  a  BP  neural  net  can  be  written  as 

=  4  E  !/(•*)  -  -BN. 
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where  /(»*)  is  the  correct  value  of  the  function  /(*)  at  **  ,  which  is  to  be  approximated 
by  the  neural  net.  fi(zjk,w)  is  the  output  of  the  BP  neural  net  at  .  JV^  is  the  total  number 
of  training  inputs  »* .  The  adjustment  of  the  network  parameter  vector  w,  called  the 
training  of  the  network,  is  carried  out  according  to  the  following  rule  based  on  the  deepest 
decent  principle: 


*=i 


(5) 


where  I  is  the  layer  level  index,  i  and  j  are  node  indices  on  level  I  or  I  -  1.  The  second 
term  on  the  right-hand  side  of  the  above  updating  rule  is  the  partial  derivative  of  the  error 
function  ER(w)  with  respect  to  the  weight  vector  component  WKi .  The  parameter  a  in  the 
updating  rule  is  called  the  learning  rate,  which  can  be  adjusted  to  improve  the  conver¬ 
gence  speed.  The  training  stops  when  the  difference  norm  of  the  weight  vector  is  below 
some  threshold.  It  is  often  that  the  training  process  is  very  time-consuming  and,  therefore, 
parallel  processing  can  be  applied  to  speed  up  the  training  process. 

It  is  fortunate  that  the  training  rule  (equation  5)  can  be  easily  parallelized  using  the 
data  partition  model.  In  particular,  we  can  let  each  processor  work  on  a  subset  of  the 
training  inputs  to  get  a  partial  sum  in  the  sum  for  the  partial  derivative  of  the  error 
function  with  respect  to  a  weight  component.  Then  a  global  sum  operation  is  performed  to 
compute  the  partial  derivative  and  the  sum  is  put  in  one  of  the  processors.  A  broadcast 
operation  can  then  be  performed  to  replicate  the  partial  derivative  to  all  processors.  Each 
processor  can  then  use  the  partial  derivative  to  update  the  weight  component.  The  advan¬ 
tage  of  this  parallel  implementation  of  a  BP  neural  net  training  is  its  simplicity.  The  code 
executed  on  each  processor  is  just  a  sequential  BP  net  training  code  plus  some  simple 
global  operations  that  can  be  implemented  very  efficiently.  Therefore,  a  good  perform¬ 
ance  speed-up  should  be  achieved  on  this  application. 

A  limitation  of  the  above  implementation  is  that  it  only  applies  to  the  training  rule 
(equation  5),  called  the  batch  mode  training.  For  some  applications,  it  has  been  found 
that  a  nonbatch  mode  training  rule  gives  a  much  faster  convergence  rate.  In  the  nonbatch 
mode  training,  the  weight  vector  w  updating  is  performed  for  each  input  rather  than 
waiting  until  the  contributions  from  all  inputs  are  accumulated.  To  parallelize  such  a 
training  process,  we  need  to  decompose  the  neural  net  structure  itself  and  let  each  proces¬ 
sor  contain  a  part  of  the  neural  net.  This  is  a  combination  of  data  and  function  partition 
since  each  input  data  vector  also  needs  to  be  distributed  among  processors.  Since  the 
processing  is  pipelined  (from  input  layer  to  output  layer),  it  is  crucial  to  have  a  high 
bandwidth  of  data  input  rate  to  the  parallel  system  so  that  the  input  operation  would  not 
become  a  bottleneck  of  the  whole  processing.  It  seems  inadequate  to  have  only  one 
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channel  for  external  I/O,  as  does  the  iWarp  system,  for  a  parallel  system  to  perform  this 
kind  of  processing. 


5  CONCLUSIONS 

We  have  discussed  in  this  report  several  mapping  strategies  for  the  implementations  of 
a  few  programming  tools,  mathematical  tools,  and  signal  and  image  processing  algo¬ 
rithms  on  the  iWarp  system.  Our  parallel  applications  development  is  based  on  the  data 
partition  MIMD  model,  which  we  think  is  a  natural  choice  for  these  applications  and  for 
the  iWarp  system.  These  mapping  strategies  should  apply  to  other  message-passing  MIMD 
systems  as  well.  For  example,  to  make  the  codes  we  developed  for  running  on  the  iWarp 
system  run  on  Intel’s  Touchstone  system,  one  only  needs  to  change  the  network  configura¬ 
tion  code  and  the  names  of  message-passing  functions. 

Good  performance  speed-ups  have  been  obtained  from  implementations  of  two- 
dimensional  FFT,  a  weighted  differencing  algorithm  for  multispectral  image  processing, 
and  some  matrix  factorization  algorithms.  It  was  also  shown,  from  algorithm  analysis 
and/or  from  actual  performance  data,  that  some  applications  like  solving  a  triangular 
linear  system  and  performing  a  singular  value  decomposition  of  a  matrix  is  more  difficult 
to  parallelize.  If  we  define  the  efficiency  of  a  parallel  application  as  the  ratio  of  speed-up 
over  the  ratio  of  numbers  of  processors  used,  the  efficiency  usually  improves  as  the  data 
size  of  the  application  increases.  For  example,  as  can  be  computed  from  figures  11 
and  12  for  using  16  processors  to  solve  a  triangular  linear  system,  the  efficiencies  are 
about  51%  and  62%  for  data  sizes  of  256  and  512,  respectively. 

When  mapping  a  composite  application  algorithm  to  a  parallel  system,  one  needs  to 
parallelize  different  parts  of  the  algorithm.  For  example,  a  moving  target  detection  algo¬ 
rithm  called  the  InfraRed-Search-and-Tracking  (IRST)  algorithm  may  consist  of  an  image 
registration  component,  a  weighted  differencing  component  and  a  three-dimensional 
matched  filtering  component.  Clearly,  the  parallel  application  developer  should  take  the 
mapping  of  the  whole  IRST  algorithm  as  an  integral  task  when  devising  a  mapping  strat¬ 
egy.  It  is  possible  that  a  mapping  strategy  is  efficient  for  one  component  but  is  not  effi¬ 
cient  for  another.  Thus  a  balance  of  efficiency  between  different  components  is  needed  to 
get  good  performance  of  the  whole  algorithm.  For  applications  discussed  in  this  report, 
we  assume  a  two-dimensional  data  array  is  partitioned  by  rows  or  columns,  and  a  matrix 
transpose  is  performed  when  the  data  distribution  is  not  suitable  for  a  parallel  operation 
on  the  data. 

Finally,  we  think  for  the  function  partition  model  like  pipelined  processing,  which  is 
popular  in  many  real-time  applications,  there  should  be  a  balance  between  the  external 
I/O  bandAvidth  to  the  parallel  system  and  the  processing  speed  within  the  parallel  system. 
The  iWarp  system  has  only  one  I/O  channel  connected  to  the  host  machine.  This  will 
become  a  performance  bottleneck  for  real-time,  pipelined  processing. 
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APPENDIX  A 


CODES 


A-1 


J^pendix  A:  main.h 


/* 

*  A  haad  fila  for  global  daclarabions . 

* 

*  Author:  John  Lou,  Coda  761 

*  Last  modifiad:  July  21,  1992 
*/ 


tincluda 

iincluda 

tincluda 

tincluda 

tincluda 

tincluda 

tincluda 

tincluda 

tincluda 

tincluda 


<sys/tiiiia .  h> 
<stdio.h> 
<math.h> 
<iwarp_cc . h> 
<gatas . h> 
<ragnums . h> 
<asin/gan_asffl .  h> 
<asm/pw_asm .  h> 
<pathlib/pl . h> 
<iwsys/gatcf g . h> 


typadaf  int  *ivactor; 
typadaf  float  *fvactor; 
typadaf  fvactor  *&Batrix; 

/*  Dafina  coamunication  constants  *! 

tdafina  la ft  0 

tdafina  right  1 

tdafina  up  2 

tdafina  down  3 

tdafina  naxjout  4 

tdafina  max_out_ring  2 

tdafina  elk  0 

tdafina  cclk  1 


/*  Function  dalarations  */ 
axtam  int  gatcfgO; 
char  *Bialloc(); 
void  BuiinO  ; 
void  8v_lds  0 ; 
void  nat_oonnact  ()  ; 
void  ring_mosh(); 
void  assign_^chan() ; 

/*  Othar  global  dafinitions 

int  _oallid,  sys_haight,  sysjwidth; 

int  haight,  width,  ncalls; 

int  rowid,  colid,  linid,  ringid; 

int  o_chan[Bia*__outI ,  i_chantB>ax_out  J ; 

int  o”ehan_ring[a»ax_out__ringl ,  T_chan_ring Iniax_out_ring]  ; 
int  nrows,  ncols; 


Appendix  A :  mesh_ring . c 


/* 

*  Sat  up  a  2-D  toroidal  mash  natwork  with  iridth*haig[ht 

*  iWarp  calls.  Call  a  ring  satup  function  to  anbad  a 

*  bidirectional  ring  in  the  2-D  mash.  Assume  the  number 

*  of  columns  of  iWarp  calls  is  even.  The  upper-left 

*  subarray  is  used. 

* 

*  Author:  John  Lou,  Coda  761 

*  Last  modified:  July  31,  1992 

* 

*1 


fincluda 

fdafina 

idafine 

fdafina 

fdafina 


’main.h" 

/*  head  i 

XR 

PL  DIR  XR 

XL 

PL  DIR  XL 

W 

PL_DIR_TO 

TO 

PL  DIR  TO 

/*  head  file  contains  global  dads  */ 


/*  Assign  2-D  mash  ID  to  each  call  */ 

void 

ay  ids  () 

{  “ 

rowid  a  _callid/sys_width; 
colid  ■  _callid  -  rowid*sys_width; 
linid  a  rowid*width  +  colid; 

)  /*  my^ids  */ 


void 

mash  ringO 

{ 

nat^coxuiact  ()  ; 

/*  Sat  vp  a  dockwisa  ring  */ 

if (! (rowid%2))  ( 

/*  even  rows  */ 

if (colid  >  0  colid  <  width  -  1)  ( 
o_chan_ring[dk]  a  o_chan  [ right  ] ; 
i_chan_ring(dk]  a  i_chan[laft] ; 

) 

if (colid  aa  0  it  Width  >  1)  ( 

o^chan_ring[dk]  a  o_chan  [  right  ] ; 
i  chan_ring(dk]  i_chan[up]; 

) 

if  (colid  aa  0  44  width  aa  1)  ( 
o_chan_ring[dk]  a  o_chan[down] ; 
i  cban_ring[dk]  a  i_chan[up]; 

) 

if (colid  aa  width  -  1  width  >  1)  ( 
o^chan_ring[dk]  a  o_chan[down]  ; 
i  chan_ring [elk]  a  i_chan{laft] ; 

> 

ringid  a  linid; 

) 

else  ( 

/*  odd  rows  •/ 

if (colid  >  0  it  eolid  <  width  -  1)  ( 
o_ohan_ring [elk]  a  o_cban[laft] ; 
i^ehan  ring[dk]  a  l_chan  [right]  ; 

) 
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J^pendix  A :  mesh_rlng . c 

if (colid  0  ££  width  >  1)  ( 

o_chan_ring[clk]  =  o_chan[down] ; 
i_chan_ring [elk]  -  i_chan [ right ] ; 

) 

if (colid  »  0  £&  width  ==  1)  ( 
o_chan_ring [elk]  s  o_chan[do%m]; 
i_chan_ring [elk]  =  i_chan[up]; 

} 

if (colid  ss  width  -  1  fiC  width  >  1)  { 
o_ehan_ring [elk]  «  o_chan[left] ; 
i_ehan_ring [elk]  =  i_ehan [up] ; 

) 

ringid  >  linid  -  colid  -  1  +  width  -  colid; 

) 

rwtum; 

)  /*  ring_iMsh  */ 


void 

not  connoct () 

{ 

int  dost,  hoadar,  8rc_id[2],  i_chan_taiqp; 
int  i; 

/*  Initialisa  PathLib  */ 
pl_init (OxOf f f f ) ; 

/*  Assign  PCTs  for  racaiving  calls  */ 

pl_xpa_configura (0x00001,  0x00002,  0x00004,  0x00008,  OxOOOfO)  ; 

if  (_callid  u  sys_width*sys_haight  |  | 

colid  >«  width  4  I  rowid  >«  haight)  axit (0) ; 

/*  initialising  handlas  */ 
for(iw0;  ioiiax_out,‘  i++)  ( 
o_chan[i]  ■  -1; 
i_chan[i]  ■  -2; 

) 

/* 

*  Csaata  oonnactions  to  othar  calls 

*  for  a  ragular  2-0  wrap-around  stash. 

*/ 

/*  to  laft  */ 
if(eolid  !-  0) 
dast  -  _callid  -  1; 

alsa 

dast  B  _callid  +  width  -  1; 
o_chan[laft]  ■■  pl_sand_oc(PL_GATK0,  XL,  Sdast,  1); 
pl_sandi (PL_(SAIE0,  right); 
pl_sandi(FL_GAIX0,  _callid  1) ; 

/*  to  right  */ 
if (colid  l«  width  -  1) 
dast  ■  _callid  +  1; 
alsa 

dast  -  _eallid  -  width  +  1; 
o_chan [right]  *  pl_sand_oc(PLjSATE0,  XR,  tdast,  1); 
pl_san^  (PL  GAIEO,  laft) ; 
pl_sandi(PLjaiK0,  -(_oallid  +  1) )  ; 

/*  to  */ 
if  (rowid  !<■  0) 
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Appendix  A :  inesh__ring .  c 

d«st  «  _callid  -  sya_width; 

•Isa 

dast  «  _callid  +  ays_jiridth*  (baighfc  -  1)  ; 
o_chan[up]  =  pl_aand_oc(PL_GATEO,  YU,  tdast,  1); 
pl_8andi (PL_GATEO,  down) ; 
pl_sandi(PL_GATEO,  _callid  4-  1); 

/*  to  down  */ 
i£(rowid  |s  height  -  1) 

dast  w  _callid  sys_width; 
alsa 

dast  a  _callid  ~  sys_width* (height  -  1) ; 
o_chan[down]  ■  pl_aand_oc(PI._GATEO,  YD,  Cdast,  1) ; 
pl_aandi (PL_GATEO ,  up); 
pl_sandi  (PL_GA.TEO,  -(_callid  -t-  1)  ) ; 

/* 

*  accept  four  connections  from  other  cells 
*/ 

i_chan_temp  >  pl_recv_oc(PL_GATEl,  (header); 
src_id[0]  ■  pl_recvi  (PljSMEl) ; 
src_id[l]  a  pl_recvi (PLjSAIEl) ; 
assign_chan(i_^chan,  i_chan_teinp,  src_id) ; 
i_chan_teap  a  pl_recv_oe  (PL_jSATEl ,  (header) ; 
src  id[0]  a  pl_recvi (PL_GAIC1) ; 
arc“ld(ll  a  pl_recvi  (PLjSJkTEl) ; 
asaign_chan(i_jchan,  i_chan_teiap,  src_id) ; 
i_chan_teap  a  pl__recv_oc  (PLjSXTEl ,  Oieader) ; 
src_id[0]  a  pl^reevi (PLjSAIEl) ; 
arc__id(l]  •  pl_recvi  (PL_6ArEl) ; 
assign_chan(i_chan,  i_chan_teinp,  src_id) ; 
i  chan  teiif>  a  pi  recv  oc(PL  GATEl,  (header); 
src  idTO]  a  pi  recwi  (PL  GAacil)  ; 
sre'idd]  •  pl"recvi  (PLjadBl)  ; 
assign_chan(i_chan,  ijchanjteap,  •rc_id); 

)  /*  msh  connt  */ 


/* 

*  Assign  coining-connection  handles  to  proper 

*  array  components. 

*/ 

void 

assign_chan (i_chan,  i_chan_temp,  sid) 
int  i_chan[],  i_chan_teii^; 
int  sid[]; 

{ 

/* 

*  Accept  connections  from  iWarp  cells 
*/ 

/*  from  left  */ 

if((sid(l]  a.  -(_oellid)  ||  siddl  a-  -(  oellid  +  width)) 
((sid(0]  a.  left)  ~ 

(i_chan(left]  a  i_chan_jbes9;  return;) 

/*  from  right  */ 

if((sid(l]  -a  oellid  +  2  ||  sid[l]  a.  _cellid  -  width  4-  2) 
((  sid[0}  n  right) 

(i_ohan[ right]  a  i_chan_teap;  return; ) 

/*  from  */ 

lf((sid[ll  aa  -(_oellid  -  sysjwidth  4-  1)  1 1  aid[ll  aa 

-(_oellid  4-  ays jwidth*  (height-1)  4-1))  ((  sid[0]  a.  up) 
(i_ehan[up]  a  i_jabanjtemp;  return; ) 

/*  from  down  */ 
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i£((ai.d[l]  ■»  _callid  +  aysjwidth  +  111  aid[l]  ■■ 
callid  —  aya_width* (haight— 1)  +  1)  *6  aid[0] 
<i^chan[doim]  ■  i_chan_taBqp;  xatum; ) 
ratum; 

)  /*  aaaign_eban  */ 


i^pendix  A :  main .  c 


/* 

*  An  iWarp  call  program. 

* 

*  A  main  program  showing  how  ho  usa  parallal  programming  tools 

*  to  do  a  walghtad  dlffaranclng  o£  Imagas  using  an  IWarp  array. 

* 

*  Author:  John  Iau,  Coda  761 

*  Last  modlflad:  July  22,  1992 
*/ 


ilncluda 

"main.h" 

idaflna 

mimag 

2 

/* 

fdaflna 

4  fwiiylg  m 

128 

/* 

idaflna 

arydlm 

8 

/* 

idaflna 

void 

mainO 

{ 

struct 

nrow_call 

16 

/* 

Iwcfg  cfg; 

finatrlx  imag[mimag] 
Imatrlx  nrow  loc; 

i 

Int 

1,  j,  ct  -  0; 

float 

cf ,  mo  [fflimag] 

,  axa^t. 

/*  gat  IWaxp  array  configuration  */ 
gatcfg (6efg)  ; 


/* - Sat  up  IWarp  array  natwork - */ 

_callld  w  cfg.callld; 

sysjhalght  «  cfg.halght,  8ys_width  <■  cfg. width; 
halght  *  sys_halght/2 ,  width  >  sys_wldth/2; 
ncalls  ••  halght*wldth; 

/*  Znltlallsa  rowld  and  colld  */ 

■y_ld8  0 ; 

/*  Zixltlallsa  Pathlilb  and  sat  iqp  a  mash  connactlon  */ 
mash_rlng() ; 

- £nd  of  natwork  satup - */ 


/* 

*  Tha  walghtad  dlffaranclng  schams  contains  tha 

*  following  stops: 

*  1)  soattar  tha  2-0  data  array  onto  tha  IWarp  array. 

*  2)  narmallsa  aach  Una  of  Imaga  data. 

*  3)  oos^puta  corralatlon  coafficlant  of  a  pair  of  imagas. 

*  4)  coi^uta  walghtad  dlffaranca. 

*/ 


nrow_loc  >  (imatrlx)  smlloc (sisaof (Ivactor) *halght) ; 
for(lwO;  Khalght;  1-4-f) 
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nrow__loc[i]  >  (ivttctor)  iiialloc(siEaof  (int)  *width) ; 

/*  scatter  input  imagas  */ 
for(i«0;  Kmiinag;  i-H-)  { 

in>ag[i]  «  (fmatrix)  nallocCsizaof  (fvacbor)  *nrow_cell) ; 
for(jB0;  j<nrow_cell;  j++) 

i>Bag[i]  [  j]  *  (float  *)  malloc  (sizaof  (float)  *iiiiagdiin) ; 
scattar  (iinag[i] ,  nro«r  loc,  imagdim,  imagdiin) ; 

> 

/*  noxnaliza  aach  lina  of  imaga  data  */ 
for(iaO;  Knimag;  i-H-) 

noziiial_lina  (iiiiag[i] ,  nEow__loc[rowid]  [colid] ,  imagdim); 

/*  coo^ta  corralation  coafficiant  and  sum  */ 

/*  computa  avaraga  *l 
for(i«0;  i-Onimag;  i-H-)  { 

ma[i]  s  sumf  (imagti]  ,  nrow_loe[Eowid]  [colid] ,  imagdim); 
gauBf (cma[i],  1,  0,  0)  ; 

) 

if  (_oallid— 0)  { 

for(ia0;  i<mimag;  i-H-)  ma[i]  ■  mo  [i]  /  (imagdim*imagdim) ; 

} 

graplicf(ma,  2); 

/*  coo^to  covarianca  */ 

cf  a  covar  (imag[0] ,  imag[l],  nEOw__loc[rowid]  [colid] , 
imagdim,  m[0],  flM[l])7 
gsumf(<cf,  1,  0,  0); 
gzaplicf (6cf,  1); 

/*  parform  waightad  diffarancing  */ 

wdif  (iaiag[0] ,  imag[l],  nroir_loc[rowid]  [colid] ,  imagdim,  cf ) ; 

/*  gathar  and  output  waightad,  difforancad  imago  */ 
gathar (imag[0] ,  nrow_loc,  imagdim); 


•xit  (0); 

)  /*  main  */ 
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